code
share


β–Ί Chapter 13: Multi-layer Perceptrons

13.2 Batch normalization

  • when dealing with linear supervised learning we saw how normalizing each input feature of a dataset significantly aids in parameter tuning by improving the shape of a cost function's contours (making them more 'circular')
  • in other words, we normalized every distribution that touches a system parameter - which in the linear case consists of the distribution of each input feature
  • the idea of normalizing parameter touching distributions carries over completely from the linear learning scenario to our current situation - where we are conducting nonlinear learning via multilayer perceptrons
  • the difference here is that now we have many more parameters (in comparison to the linear case) and many of these parameters are internal as opposed to weights in a linear combination
  • here we will need to normalize the output of each and every network activation
  • moreover since these activation distributions naturally change whenever during parameter tuning - e.g., whenever a gradient descent step is made - we must normalize these internal distributions every time we make a parameter update
  • This leads to the incorporation of a normalization step grafted directly onto the architecture of the multilayer perceptron itself - which is called every time weights are changed
  • this natural extension of input normalization is popularly referred to as batch normalization.
InΒ [Β ]:

13.2.1 The stable weight touching distributions of a linear modelΒΆ

When discussing linear model based learning in Chapters 8 - 11 we employed the generic linear model

\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + x_1w_1 + \cdots + x_Nw_N \end{equation}

for both regression and classification.

  • when tuning these weights via the minimization of any cost function over a dataset of $P$ points $\left\{\mathbf{x}_p,y_p\right\}_{p=1}^P$ we can see how the $n^{th}$ dimension of each input point $x_{p,n}$ touches the $n^{th}$ weight $w_n$
  • performing standard normalization along the $n^{th}$ input feature means making the replacement
\begin{equation} x_{p,n} \longleftarrow \frac{x_{p,n} - \mu_{n}}{\sigma_{n}} \end{equation}

where $\mu_n$ and $\sigma_n$ are the mean and standard deviation along the $n^{th}$ feature of the input, respectively

  • of course once normalized these input distributions $\left\{x_{p,n} \right\}_{p=1}^P$ for each $n$ never change again - they remain stable regardless of how we set the parameters of our model / during training.

13.2.2 Batch normalized single layer perceptron unitsΒΆ

  • now we employ a linear combination of $B$ multilayer perceptron feature transformations in a nonlinear model
\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f_1\left(\mathbf{x}\right)w_1 + \cdots + f_B\left(\mathbf{x}\right)w_B \end{equation}
  • studying a simple example here we can see that standard normalization of the input only tempers the contours of a cost function along the weights internal to the first layer of a multilayer perceptron only
  • these are the weights touched by the distribution of each input dimension.
  • e.g., say we use a single hidden layer perceptron, the $b^{th}$ of which looks like
\begin{equation} f^{(1)}_b\left(\mathbf{x}\right)=a\left(w^{\left(1\right)}_{0,\,b}+\underset{n=1}{\overset{N}{\sum}}{w^{\left(1\right)}_{n,\,b}\,x_n}\right) \end{equation}

  • so our model takes the form
\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f^{(1)}_1\left(\mathbf{x}\right)w_1 + \cdots + f^{(1)}_B\left(\mathbf{x}\right)w_B. \end{equation}
  • here we see the data input does not touch weights of the linear combination $w_1,\,w_2,...,w_B$ (as was the case with our linear model), but touches the internal weights of each perceptron
  • so how to temper the contours of the cost function along the weights of the linear combination $w_1,\,w_2,...,w_B$? Which data distributions touch them?
  • a glance at our model above and we can see that it is the distribution of each perceptron / activation output over the input data
  • in other words, the distribution $\left\{f^{(1)}_b\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ touches the weight $w_b$ of the linear combiination
  • in complete analogy to how we have normalized the input data we could perform standard normalization on each of these activation output distributions
  • to do this we normalize each activation output by mean centering and re-scaling by its corresponding standard deviation as
\begin{equation} f_b^{(1)} \left(\mathbf{x}_p \right) \longleftarrow \frac{f_b^{(1)} \left(\mathbf{x}_p\right) - \mu_{f_b^{(1)}}}{\sigma_{f_b^{(1)}}} \end{equation}

where

\begin{array} \ \mu_{f_b^{(1)}} = \frac{1}{P}\sum_{p=1}^{P}f_b^{(1)}\left(\mathbf{x}_p \right) \\ \sigma_{f_b^{(1)}} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f_b^{(1)}\left(\mathbf{x}_p \right) - \mu_{f_b^{(1)}} \right)^2} \end{array}
  • will this help? we can try a few experiments to see
  • however note that unlike the distribution of the input data - the distribution of each of these activation outputs can shift every time the internal parameters of our system are changed
  • like e.g., when we take a gradient descent step
  • this is called internal covariate shift
  • so since the weights of our model will change during optimization in order to keep the activation output distributions normalized we must normalize them at every step of parameter tuning (e.g., via gradient descent)
  • to do this we can simply build in a standard normalization step directly into the perceptron architecture itself
  • thinking about a multilayer perceptron precisely the same logic leads to the notion of standard normalizing every distribution of activation outputs
  • this is called batch normalization of a multilayer perceptron, and typically speeds up optimization considerably

13.2.4 A Python implementation of batch normalizationΒΆ

  • our standard multilayer perceptron
InΒ [2]:
# fully evaluate our network features using the tensor of weights in w
def feature_transforms(a, w):    
    # loop through each layer matrix
    for W in w:
        #  pad with ones (to compactly take care of bias) for next layer computation        
        o = np.ones((1,np.shape(a)[1]))
        a = np.vstack((o,a))
        
        # compute linear combination of current layer units
        a = np.dot(a.T, W).T
    
        # pass through activation
        a = activation(a)
    return a
  • to standard normalize the activation outputs we simply need to add a few lines to the end of the for loop that subtracts off the mean of a along each of its inputs and divides off their associated standard deviations (provided they are non-zero
  • we have already developed a simple module for performing this computation in Section 8.4, and give a short version of the standard_normaliizer function
InΒ [3]:
# standard normalization function 
def standard_normalizer(x):
    # compute the mean and standard deviation of the input
    x_means = np.mean(x,axis = 1)[:,np.newaxis]
    x_stds = np.std(x,axis = 1)[:,np.newaxis]   

    # check to make sure thta x_stds > small threshold, for those not
    # divide by 1 instead of original standard deviation
    ind = np.argwhere(x_stds < 10**(-2))
    if len(ind) > 0:
        ind = [v[0] for v in ind]
        adjust = np.zeros((x_stds.shape))
        adjust[ind] = 1.0
        x_stds += adjust

    # create standard normalizer function
    normalizer = lambda data: (data - x_means)/x_stds

    # return normalizer 
    return normalizer
  • call the simple adjusted architecture feature_transforms_batch_normalized
InΒ [4]:
# a multilayer perceptron network, note the input w is a tensor of weights, with 
# activation output normalization
def feature_transforms_batch_normalized(a, w):    
    # loop through each layer matrix
    for W in w:
        #  pad with ones (to compactly take care of bias) for next layer computation        
        o = np.ones((1,np.shape(a)[1]))
        a = np.vstack((o,a))
        
        # compute linear combination of current layer units
        a = np.dot(a.T, W).T
    
        # pass through activation
        a = activation(a)
        
        # NEW - perform standard normalization to the activation outputs
        normalizer = standard_normalizer(a)
        a = normalizer(a)
    return a

13.2.5 Examples illustrating covariate shift and the benefit of batch normalizationΒΆ

Example 1. The shifting distributions / internal covariate shift of a single layer perceptronΒΆ

  • here we illustrate the covariate shift of a single layer perceptron with two relu units $f^{(1)}_1$ and $ f^{(1)}_2$
\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + f^{(1)}_1\left(\mathbf{x}\right)w_1 + f^{(1)}_2\left(\mathbf{x}\right)w_2 \end{equation}
InΒ [309]:
  • run $10,000$ steps of gradient descent to minimize the softmax cost using this single layer network, where we standard normalize the input data
InΒ [310]:
  • below we animate how the distribution changes over each step of gradient descent $\left\{f^{(1)}_1\left(\mathbf{x}_p\right),\,f^{(1)}_2\left(\mathbf{x}_p\right) \right\}_{p=1}^P$
InΒ [311]:
Out[311]:



  • As you can see by moving the slider around, the distribution of activation outputs - i.e., the distributions touching the weights of our model's linear combination $w_0$ and $w_1$ - change dramatically as the gradient descent algorithm progresses.
  • We can intuit (from our previous discussions on input normalization) that this sort of shifting distribution negatively effects the speed at which gradient descent can properly minimize our cost function.
  • Now we repeat the above experiment using the batch normalized single layer perceptron - making a run of $10,000$ gradient descent steps using the same initialization used above
InΒ [312]:
Out[312]:



Example 2. The shifting distributions / covariate shift of a multilayer perceptronΒΆ

In this example we illustrate the covariate shift of a standard $4$ layer multilayer perceptron with two units per layer, using the relu activation and the same dataset employed in the previous example.

InΒ [410]:

Each layer's output distribution is shown in this panel, with the output of the first layer $\left(f_1^{(1)},f_2^{(1)}\right)$ are colored in cyan, the second layer $\left(f_1^{(2)},f_2^{(2)}\right)$ is colored magenta, the third layer $\left(f_1^{(3)},f_2^{(3)}\right)$ colored lime green, and the fourth layer $\left(f_1^{(4)},f_2^{(4)}\right)$ is shown in orange. In analogy to the animation shown above for a single layer network, here the horizontal and vertical quantites of each point shown represent the activation output of the first and second unit respectively for each layer.

InΒ [412]:
Out[412]:



Performing batch normalization on each layer of this network helps considerably in taming this covariate shift. Below we run the same experiment, using the same initialization, activation, and dataset using the batch normalized version of the network. Afterwards we again animate the covariate shift for a subset of the steps of gradient descent.

InΒ [413]:

Moving the slider from left to right below progresses the animation from the start to finish of the run. Scanning over the entire range of steps we can see in the left panel that the distribution of each layer's activation outputs remains much more stable than previously.

InΒ [415]:
Out[415]:



Example 3. The shifting distributions / internal covariate shift of a multilayer perceptronΒΆ

In this example we illustrate the benefit of batch normalization in terms of speeding up optimization via graident descent on a dataset of $10,000$ handwritten digits from the MNIST dataset. Each image in this dataset has been contrast normalized, a common preprocessing step for image dataset we discuss later in the context of convolutional networks. Here we show $1000$ steps of gradient descent, with the largest steplength of the form $10^{-\gamma}$ for integer $\gamma$ we found produced adequate convergence, comparing the standard to batch normalized versions of a network with relu activation and a three layer architecture with 10 units per layer. Here we can see that both in terms of cost function value and number of misclassifications the batch normalized version of the perceptron allows for much more rapid minimization via gradient descent than the original version.

InΒ [303]:

13.2.6 Evaluating test points using a batch normalized networkΒΆ

Ann important point to to remember when employing a batch normalized network - which we encountered earlier in e.g., Sections 8.4 and 9.4 when introducing standard normalization of input data - is that we must treat test data precisely as we treat training data. Here this means that every step of normalization computed on the training data, the various means and standard deviations of the input as well as for each each layer of output activation, must be used in the evaluation of new test points as well. In other words, all normalization constants in a batch normalized network should be fixed to the values computed on the training data (at the best step of gradient descent) when evaluating new test points.

In order to properly evaluate test points with our normalized architecture they must be normalized with respect to the same network statistics (i.e., the same input and activation output distribution normalizations) used for the training data.